read.delim() is used to read tab seperated data. Therafter, the required columns are filtered and first column is used as a rownames. The dataset is scaled to make it ready for plotting heatmap.
library(plotly)
library(seriation)
library(GGally)
library(tidyverse)
pricesnearn <-read.delim("prices-and-earnings.txt",stringsAsFactors = FALSE)
pricesnearn <- pricesnearn[,c(1,2,5,6,7,9,10,16,17,18,19)]
row.names(pricesnearn) <- pricesnearn$City
scaledprices <- scale(pricesnearn[, c(-1)])
Without reordering, very small cluster is to be seen and difficult to see outliers.
m <- list(
l = 10,
r = 10,
b = 50,
pad = 2
)
heatmapp<-function(df,str) {
plot_ly(width=900, height=1200,
x = colnames(df),
y = rownames(df),
z = df,
type = "heatmap",
colors = colorRamp(c("yellow", "red")),
colorbar = list(title = "Color-Scale")
) %>% layout(title=str,margin=m)
}
heatmapp(scaledprices,"Heatmap of Countries")
On comparing two heatmap generated based on Euclidean Distance and 1-Corr which compute orders that optimizes Hamiltonian Path Length and use Hierarchical Clustering (HC) as the optimization algorithm, heatmap generated by Euclidean distance seems to have better clustering. Even though the permutation of rows is same in both method but there is difference in how columns are arranged. Using Euclidean method, more similar clusters are visible between countries because the gradient of color is less.
Using 1- Corr the similar clusters are visible for Bread, Big Mac and iphone prices among various countries around central left part of the map while with Euclidean distance more similar clusters are visible around top 1/4 of the map, on left central and bottom right of the map.
dist_cols<-dist(t(scaledprices),method="euclidean",diag=FALSE)
dist_rows<-dist(scaledprices,method="euclidean",diag=FALSE)
order1_euc<-seriate(dist_rows,"OLO")
order2<-seriate(dist_cols,"OLO")
ord1<-get_order(order1_euc)
ord2<-get_order(order2)
scaledprices_euc <- scaledprices[rev(ord1),ord2]
heatmapp(scaledprices_euc,"Heatmap using Euclidean Distances")
corr_rows<-1-cor(t(scaledprices))
corr_cols<-1-cor(scaledprices)
order1<-seriate(as.dist(corr_rows),"OLO")
order2<-seriate(as.dist(corr_cols),"OLO")
ord1<-get_order(order1)
ord2<-get_order(order2)
scaledprices_corr <- scaledprices[ord1,ord2]
heatmapp(scaledprices_corr,"Heatmap using 1-Correlation as Distances")
Heatmap of the dataset is generated by permutation of rows and columns using Hamiltonian Path Length which uses TSP as a optimizer. The rows and column in this heatpmap are arranged differently from heatmap generated by HC. Additonally, the clusters arranged by TSP are more similar than HC which is visually evident.
order1_tsp<-seriate(dist_rows,"TSP")
order2<-seriate(dist_cols,"TSP")
ord1<-get_order(order1_tsp)
ord2<-get_order(order2)
TSP_solv<-criterion(dist_rows,order1,method=c("Path_Length","Gradient_raw"))
scaledprices_euc <- scaledprices[rev(ord1),ord2]
heatmapp(scaledprices_euc,"Heatmap-TSP using Hamiltonian Path Length")
Here we compare how efficient is TSP against Hierarchical clustering as an optimizer for clustering on basis of different loss/merit function.
## HC TSP
## Path_length 121.9671 121.8256
## Gradient_weighted 156151.0604 107595.7915
## Gradient_raw 65528.0000 42136.0000
Here there are two prominent cluster which can be identified based on variable “iphone” having value greater than 56 or lesser than 56. Clusters can be colored using red and blue .
Visually the clusters becomes much clearer if variables are arranged in following order from left to right-Big Mac,Bread,iphone4S,Rice(in kg),Hours Worked,Food Cost,Clothing Index,Wage Net,Goods and Services , Vacation.
pricesnearn$PhoneColor<-ifelse(pricesnearn$iPhone.4S.hr. > 56 ,1 ,0)
p <- pricesnearn %>% plot_ly(width=1000,height=600,type = "parcoords",
line = list(color = ~PhoneColor,
colorscale = list(c(0,'red'),c(1,'blue'))),
dimensions = list(
list(label = "Food Cost", values = ~ Food.Costs...),
list(label = "Iphone4S", values = ~ iPhone.4S.hr.),
list(label = "Clothing Index", values = ~ Clothing.Index),
list(label = "Hours Worked", values = ~ Hours.Worked),
list(label = "Wage Net", values = ~ Wage.Net),
list(label = "Vacation Days", values = ~ Vacation.Days),
list(label = "Big Mac", values = ~ Big.Mac.min.),
list(label = "Bread(in kg)", values = ~ Bread.kg.in.min.),
list(label = "Rice(in kg)", values = ~ Rice.kg.in.min.),
list(label = "Goods and Services", values = ~Goods.and.Services...)
)) %>% layout(title="Parallel Plot")
p
library(scales)
Ps=list()
scaledprices_euc <- as.data.frame(scaledprices_euc)
nPlot<- nrow(scaledprices_euc)
scaledprices_euc%>% add_rownames(var="City") -> scaledprices_radar
for ( i in 1:nPlot) {
Ps[[i]]<-htmltools::tags$div(
plot_ly(type="scatterpolar",
r=as.numeric(scaledprices_radar[i,-1]),
theta= colnames(scaledprices_radar)[-1],
fill="toself") %>%
layout(title=scaledprices_radar$City[i]),style="width: 25%;")
}
h<-htmltools::tags$div(style="display: flex; flex-wrap: wrap; ",Ps)
htmltools::browsable(h)
First cluster includes Tokyo, Seoul, Hongkong and Toronto. On stacking one above another, it can be seen that variables of Tokyo are not similar with respect to another country . Food cost, Clothing Index and Goods and services can be considered as outlier for Tokyo with respect to other country .
Second cluster includes Chicago ,Los Angeles, Miami abd Tel Aviv. “Rice in kg” variable for Tel Aviv can be considered as outlier.
plot_ly(
type = 'scatterpolar',
fill="toself"
)%>%add_trace(
r=as.numeric(scaledprices_corr["Tokyo",-11]),
theta= colnames(scaledprices_corr)[-11],
name="Tokyo") %>%
add_trace(
r=as.numeric(scaledprices_corr["Seoul",-11]),
theta= colnames(scaledprices_corr)[-11],
name="Seoul") %>%
add_trace(
r=as.numeric(scaledprices_corr["Hong Kong",-11]),
theta= colnames(scaledprices_corr)[-11],
name="Hong Kong") %>%
add_trace(
r=as.numeric(scaledprices_corr["Toronto",-11]),
theta= colnames(scaledprices_corr)[-11],
name="Toronto")
plot_ly(
type = 'scatterpolar',
fill="toself"
)%>%add_trace(
r=as.numeric(scaledprices_corr["Chicago",-11]),
theta= colnames(scaledprices_corr)[-11],
name="Chicago") %>%
add_trace(
r=as.numeric(scaledprices_corr["Los Angeles",-11]),
theta= colnames(scaledprices_corr)[-11],
name="Los Angeles") %>%
add_trace(
r=as.numeric(scaledprices_corr["Miami",-11]),
theta= colnames(scaledprices_corr)[-11],
name="Miami") %>%
add_trace(
r=as.numeric(scaledprices_corr["Tel Aviv",-11]),
theta= colnames(scaledprices_corr)[-11],
name="Tel Aviv")
###7. Plots for analyzing data Among heatmaps,parallel coordinate and radar charts, heatmaps seems to be best at analysing the similarity of variable and objects due to visible gradient in color. With parallel plots it becomes difficult to trace all lines and some times it might become difficult to see pattern. With Radar chart it might become difficult to judge orientation.Radar chart and parallel plots are good at identifying outliers.